This is my RMD-output site for course exercises in Open Data Science MOOC-course. My repository for this course can be found here
I’m new to version control in github, so let’s see if this becomes a habit. :)
Done: * Registered to github, installed it and pushed my first files * Committed and pushed updated files to git
This data consist of seven variables and 166 observations. The data was collected in 2014 (Vehkalahti, 2014) from students attending to a statistics in social sciences course. The survey assessed attitudes towards statistics, individual learning styles and points achieved in the course by the individual. In the data used here, items have been aggregated (Attitude, deep, strat, surf), and divided by the N of items. Further, rows where Points = 0 were removed from the dataset (N = 183, after removing zeros, N = 166). Find my R-file for data wrangling here.
More information of the dataset at hand can be found here
Kimmo Vehkalahti: ASSIST 2014 - Phase 3 (end of Part 2), N=183 Course: Johdatus yhteiskuntatilastotieteeseen, syksy 2014 (Introduction to Social Statistics, fall 2014 - in Finnish), international survey of Approaches to Learning, made possible by Teachers’ Academy funding for KV in 2013-2015.
setwd("C:/Users/rekar/Documents/Road to PhD/Kurssit/Open data science/IODS-project/data")
lrn <- read.csv("lrn.csv")
library(tidyverse)
library(viridis)
#Rename gender variable for plotting purposes
lrn <-
lrn %>%
mutate(gender=recode(gender, "M"="Male", "F"="Female"))
head(lrn)
## gender Age Attitude deep stra surf Points
## 1 Female 53 3.7 3.583333 3.375 2.583333 25
## 2 Male 55 3.1 2.916667 2.750 3.166667 12
## 3 Female 49 2.5 3.500000 3.625 2.250000 24
## 4 Male 53 3.5 3.500000 3.125 2.250000 10
## 5 Male 49 3.7 3.666667 3.625 2.833333 22
## 6 Female 38 3.8 4.750000 3.625 2.416667 21
Below you can find a summary of the data produced by describe from the psych-library. After this I examined the data visually using ggpairs. After this I ran two plots to show the distributions more clearly and further run a few geom_smooths for fun. I tried to add the viridis colouring to the ggpairs, but it didn’t get applicated to all of the plots. If you, dear student-peer-reviewer, have a solution to this, please comment your solution in the peer-review.
First we take a look at the distribution of age and gender, and we see that the males attending to this course were older than the females. Second we look at the distribution in attitudes towards statistics, where we find the mean (marked as an asterix) and median being higher among males. Third, I was interested if age had any association to the points acquired, for which a a loess smooth was calculated to overfit the data.
After this I plotted a few scatter plots with geom_smooths (OLS regressions) to assess relationships among variables. We find a clear relation between attitude and points earned in the course. After this relation of attitude was examined to learning styles. Visually this relationship seems to be only among males.
Further, the relation of learning strategies to acquired course points were assessed. Here we can find and association of learning strategies to points earned and high learning strategy associated with lower course points. However, CI-bands are quite wide.
#Using the psych library for extensive summary
library(psych)
library(GGally)
describe(lrn)
## vars n mean sd median trimmed mad min max range skew
## gender* 1 166 1.34 0.47 1.00 1.30 0.00 1.00 2.00 1.00 0.68
## Age 2 166 25.51 7.77 22.00 23.99 2.97 17.00 55.00 38.00 1.89
## Attitude 3 166 3.14 0.73 3.20 3.15 0.74 1.40 5.00 3.60 -0.08
## deep 4 166 3.68 0.55 3.67 3.70 0.62 1.58 4.92 3.33 -0.50
## stra 5 166 3.12 0.77 3.19 3.14 0.83 1.25 5.00 3.75 -0.11
## surf 6 166 2.79 0.53 2.83 2.78 0.62 1.58 4.33 2.75 0.14
## Points 7 166 22.72 5.89 23.00 23.08 5.93 7.00 33.00 26.00 -0.40
## kurtosis se
## gender* -1.54 0.04
## Age 3.24 0.60
## Attitude -0.48 0.06
## deep 0.66 0.04
## stra -0.45 0.06
## surf -0.27 0.04
## Points -0.26 0.46
ggpairs(lrn, mapping = aes(col=gender, alpha=0.2), lower=list(combo = wrap("facethist", bins =20))) + scale_colour_viridis_d(option = "H")
#Age * sex
lrn %>%
ggplot(aes(gender, Age, colour=gender)) +
geom_violin() +
scale_colour_viridis_d(option="H") +
geom_boxplot(width=0.2, size=0.2, color="black", alpha=0.4, outlier.size = 0) +
stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
theme_bw()
#Attitud * Sex
lrn %>%
ggplot(aes(gender, Attitude, color=gender)) +
geom_violin() +
scale_colour_viridis_d(option="H") +
geom_boxplot(width=0.2, size=0.2, color="black", alpha=0.4, outlier.size = 0) +
stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
theme_bw()
#Age's relation to points
lrn %>%
ggplot(aes(Age, Points, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=loess) +
scale_colour_viridis_d(option="H") +
theme_bw()
#Attitude * points
lrn %>%
ggplot(aes(Attitude, Points, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=lm) +
scale_colour_viridis_d(option="H") +
theme_bw()
#Attitude * learning style deep
lrn %>%
ggplot(aes(Attitude, deep, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=lm) +
scale_colour_viridis_d(option="H") +
theme_bw()
#Attitude * learning strategy
lrn %>%
ggplot(aes(Attitude, stra, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=lm) +
scale_colour_viridis_d(option="H") +
theme_bw()
#attitude + surface learning style
lrn %>%
ggplot(aes(Attitude, surf, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=lm) +
scale_colour_viridis_d(option="H") +
theme_bw()
#Learning style deep * points
lrn %>%
ggplot(aes(deep, Points, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=lm) +
scale_colour_viridis_d(option="H") +
theme_bw()
lrn %>%
ggplot(aes(stra, Points, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=lm) +
scale_colour_viridis_d(option="H") +
theme_bw()
lrn %>%
ggplot(aes(surf, Points, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=lm) +
scale_colour_viridis_d(option="H") +
theme_bw()
For this exercise we were asked to fit three IV’s to model the relationship to points earned, which is the DV of the model, and remove non significant IV’s in the final model.
I will fit the model (OLS regression) with attitude, learning strategy and surface learning as IV’s, as the association was of these was observed in the above visualisations.
Below we can see the summaries of lfit (with all the three IV’s aforementioned) and lfit2 (attitude as the only IV). We find that that the attitude has a quite of a strong association to the points acquired (B = 3.39, p = <.001). Surface learning strategy is associated negatively to the outcome, and learning strategies positively, however these two were far from significant. Surface learning style and stra were not significantly associated to Points as single predictors either. The multiple R squared is .1 higher in lfit, so the added predictors do account to a tiny bit of the variance explained. However, taking into account the NS of these predictors and that the adjusted R squared change is only <.1, we can be happy with attitude explaining 19% of the variance in acquired course points.
lfit <- lm(Points ~ Attitude + surf + stra, data=lrn)
summary(lfit)
##
## Call:
## lm(formula = Points ~ Attitude + surf + stra, data = lrn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## Attitude 3.3952 0.5741 5.913 1.93e-08 ***
## surf -0.5861 0.8014 -0.731 0.46563
## stra 0.8531 0.5416 1.575 0.11716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
#removing NS IV's
lfit2 <- lm(Points ~ Attitude, data=lrn)
summary(lfit2)
##
## Call:
## lm(formula = Points ~ Attitude, data = lrn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## Attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Residuals vs fitted plot show us that the assumption of linear association is reasonable. Q-Q plot affirms us the assumption of normality, notwithstanding that a few outliers emerge. According to these diagnostics, the LM assumption of Points ~ Attitude is reasonably valid.
par(mfrow = c(2,2))
plot(lfit2, which=c(1,2,5))
setwd("C:/Users/rekar/Documents/Road to PhD/Kurssit/Open data science/IODS-project/data")
pormath <- read.csv("pormath.csv")
library(dplyr)
library(tidyverse)
library(GGally)
library(viridis)
library(sjPlot)
library(cowplot)
In this Exercise, the data assessed is retrieved from the Machine Learning Repository, collected by Paolo Cortez. It comprises of two data sets of Secondary school students from Portuguese schools. The data comprises of both school reports and questionnaires. More information about the data set used in this exercise can be found here. The two separate data sets are about performance in maths and Portuguese, respectively. Only the participants who were in the both data sets, are included in the combined data used in this exercise and others excluded. Hereby, the total N of observations is 370.
colnames(pormath)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
The association of the following variables to high alcohol consumption will be assessed. High alcohol consumption here is defined as > 2, which is computed from alcohol consumption during weekdays summed with alcohol consumption during weekends (very low=1, to very high=5) divided by two.
In the following arguments for choosing these variables, I won’t be citing any research, since it is out of the scope of the current exercise. However, for presenting working hypotheses, I shortly argue why I except certain outcomes. Furthermore, I have not familiarised myself with the socio-cultural nor with the contextual factors that should be taken into account when assessing the following factors.
First, parental cohabitation (Pstatus) is an interesting variable, since co-parenting is often associated with better all-round well-being in children and adolescents. I expect cohabiting parenthood being associated with lower levels of alcohol consumption. Second, I find interesting to assess if attending to nursery has an association with alcohol consumption. As I am not familiar with the system regarding nurseries in Portugal, I can only take a wild guess, and except that those who attended nursery school (i.e., pre-school) have been more prepared to start primary school, and therefore experienced less hardships related to the very start of their educational path. Third, I assess the association of number of past class failures with high alcohol consumption. I assume that some adolescents not having resources (social or emotional/instrumental) to cope with failures, may build self-handicapping strategies, which then may turn into a vicious cycle; hence the higher the amount of failures, the higher the probability of high alcohol consumption. Finally, I except the perceived familial relations (famrel) having also an association with the alcohol consumption outcome: the higher the familial relations, the lower the probability of high alcohol consumption.
###Select the variables to a new DF####
pormath_s <- pormath %>%
select(Pstatus, nursery, failures, famrel, sex, age, high_use)
### Change logical high_use to numerical ####
#pormath_s$high_use <- as.numeric(pormath_s$high_use) # 1 = True, 0 = False
### Distributions with the discrete variable (alc_use) ####
p1 <- (pormath %>%
ggplot(aes(Pstatus, alc_use)) +
see::geom_violinhalf() +
geom_boxplot(width=0.2, size=0.2, alpha=0.4) +
stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
stat_summary(fun.data = mean_se, geom = "errorbar", width=0.1, size=0.5) +
theme_bw() +
ylab("Alcohol use") +
xlab("Parents Apart (A) or Together (T)?")
)
p2 <- (pormath %>%
ggplot(aes(nursery, alc_use)) +
see::geom_violinhalf() +
geom_boxplot(width=0.2, size=0.2, alpha=0.4) +
stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
stat_summary(fun.data = mean_se, geom = "errorbar", width=0.1, size=0.5) +
theme_bw() +
ylab("Alcohol use") +
xlab("Attended nursery school?")
)
p3 <- (pormath %>%
ggplot(aes(failures, alc_use, color=alc_use)) +
stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
stat_summary(fun.data = mean_se, geom = "errorbar", width=0.1, size=0.5) +
theme_bw() +
ylab("Alcohol use") +
xlab("Number of past Class failures")
)
p4 <- (pormath %>%
ggplot(aes(famrel, alc_use)) +
stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
stat_summary(fun.data = mean_se, geom = "errorbar", width=0.1, size=0.5) +
theme_bw() +
ylab("Alcohol use") +
xlab("Familial relations \n (1 = very bad, 5 = excellent)")
)
plot_grid(p1, p2, p3, p4)
### Cross-tabulations ####
tab_xtab(var.row = pormath_s$Pstatus, var.col = pormath_s$high_use, title = "Parents Cohabiting?", show.row.prc = TRUE)
| Pstatus | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| A |
26 68.4Â % |
12 31.6Â % |
38 100Â % |
| T |
233 70.2Â % |
99 29.8Â % |
332 100Â % |
| Total |
259 70Â % |
111 30Â % |
370 100Â % |
χ2=0.001 · df=1 · φ=0.012 · p=0.970 |
tab_xtab(var.row = pormath_s$nursery, var.col = pormath_s$high_use, title = "Attended to Nursery School?", show.row.prc = TRUE)
| nursery | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| no |
46 63.9Â % |
26 36.1Â % |
72 100Â % |
| yes |
213 71.5Â % |
85 28.5Â % |
298 100Â % |
| Total |
259 70Â % |
111 30Â % |
370 100Â % |
χ2=1.249 · df=1 · φ=0.066 · p=0.264 |
tab_xtab(var.row = pormath_s$failures, var.col = pormath_s$high_use, title = "Number of past Class failures", show.row.prc = TRUE)
| failures | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| 0 |
238 73.2Â % |
87 26.8Â % |
325 100Â % |
| 1 |
12 50Â % |
12 50Â % |
24 100Â % |
| 2 |
8 47.1Â % |
9 52.9Â % |
17 100Â % |
| 3 |
1 25Â % |
3 75Â % |
4 100Â % |
| Total |
259 70Â % |
111 30Â % |
370 100Â % |
χ2=14.304 · df=3 · Cramer’s V=0.197 · Fisher’s p=0.004 |
tab_xtab(var.row = pormath_s$famrel, var.col = pormath_s$high_use, title = "Familial relations", show.row.prc = TRUE)
| famrel | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| 1 |
6 75Â % |
2 25Â % |
8 100Â % |
| 2 |
9 50Â % |
9 50Â % |
18 100Â % |
| 3 |
39 60.9Â % |
25 39.1Â % |
64 100Â % |
| 4 |
128 71.1Â % |
52 28.9Â % |
180 100Â % |
| 5 |
77 77Â % |
23 23Â % |
100 100Â % |
| Total |
259 70Â % |
111 30Â % |
370 100Â % |
χ2=8.466 · df=4 · Cramer’s V=0.151 · Fisher’s p=0.081 |
Of the expectations, only two of the variables had statistically signicant relationship with high consumption of alcohol, namely number of past class failures with a positive prediction (CI 95 % for OR = 1.31 — 2.86) and family relations with a negative prediction (CI 95% for OR = 0.60 — 0.98). The other two variables had no statistically significant relationship with high consumption of alcohol.
m1 <- glm(high_use ~ Pstatus + nursery + failures + famrel, family="binomial", data=pormath_s)
jtools::summ(m1)
## MODEL INFO:
## Observations: 370
## Dependent Variable: high_use
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## <U+03C7>²(4) = 18.16, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.07
## Pseudo-R² (McFadden) = 0.04
## AIC = 443.88, BIC = 463.45
##
## Standard errors: MLE
## ------------------------------------------------
## Est. S.E. z val. p
## ----------------- ------- ------ -------- ------
## (Intercept) 0.39 0.64 0.61 0.54
## PstatusT -0.10 0.38 -0.27 0.79
## nurseryyes -0.31 0.29 -1.08 0.28
## failures 0.65 0.20 3.29 0.00
## famrel -0.27 0.12 -2.14 0.03
## ------------------------------------------------
OR1 <- coef(m1) %>% exp
CI1 <- confint(m1) %>% exp
cbind(OR1, CI1)
## OR1 2.5 % 97.5 %
## (Intercept) 1.4764105 0.4176121 5.1647164
## PstatusT 0.9038712 0.4366755 1.9686682
## nurseryyes 0.7345732 0.4215456 1.3009277
## failures 1.9163856 1.3101788 2.8641095
## famrel 0.7654375 0.5985101 0.9775722
Here I further examine the predictive power of the predictors, that were significant in the above model. The predictive power is fair, as it predicted about 30% of the outcomes wrong.
# predict() the probability of high_use
probabilities <- predict(m1, type = "response")
# add the predicted probabilities to 'alc'
pormath_s <- mutate(pormath_s, probability = probabilities)
# use the probabilities to make a prediction of high_use
pormath_s <- mutate(pormath_s, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = pormath_s$high_use, prediction = pormath_s$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 248 11
## TRUE 99 12
ggplot(pormath_s, aes(probability, high_use, col=prediction)) +
geom_point()
table(high_use = pormath_s$high_use, prediction = pormath_s$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67027027 0.02972973 0.70000000
## TRUE 0.26756757 0.03243243 0.30000000
## Sum 0.93783784 0.06216216 1.00000000
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = pormath_s$high_use, prob = pormath_s$probability)
## [1] 0.2972973
The ten-fold cross-validation shows a close performance of the predictive power of this model in the training and test sets (about 31% wrong) compared to to testing it with the whole data (30 % wrong).
loss_func(class = pormath_s$high_use, prob = pormath_s$probability)
## [1] 0.2972973
cv <- boot::cv.glm(data = pormath_s, cost = loss_func, glmfit = m1, K = 10)
cv$delta
## [1] 0.3135135 0.3129730
Here I compare two different models to the M1 tested above. Before evaluating predictive power further, I examine these models by BIC change and R^2 -change. We find the third model below being the best of the three according to BIC and R^2 measures.
Further, the prediction power of M3 (cross-validated error rate about 22%) exceeds the power of the first model (M1, about 30% error rate) tested above, and moreover exceeds the predictive power of the model introduced in the course Datacamp (about 26% error rate).
According to this model (M3, last summary below), males had higher probability of high alcohol consumption compared to their female counterparts (CI 95 % for OR = 1.66 — 4.76). Second, adolescents who spent more time outside, were also more likely to exceed our cut-off point for high alcohol consumption (CI 95 % for OR = 1.68 — 2.77). Third, adolescents living in urban areas were less likely to do so compared to their counterparts living in rural areas (CI 95 % for OR = 0.28 — 0.97). Fourth, adolescents who had failed classes, were more likely to do so (CI 95 % for OR = 0.95 — 2.37), however it is worth noting, that predicting alcohol consumption based on experienced failures with this data builds on a very few observations (see observations in each class from the cross-tabulations above). Fifth, adolescents with good familial relations were less likely to do so (CI 95 % for OR = 0.50 — 0.89). Finally, adolescents with more absences were also more likely to consume more alcohol, however the OR was quite modest (CI 95 % for OR = 1.04 — 1.13).
jtools::summ(m1)
## MODEL INFO:
## Observations: 370
## Dependent Variable: high_use
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## <U+03C7>²(4) = 18.16, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.07
## Pseudo-R² (McFadden) = 0.04
## AIC = 443.88, BIC = 463.45
##
## Standard errors: MLE
## ------------------------------------------------
## Est. S.E. z val. p
## ----------------- ------- ------ -------- ------
## (Intercept) 0.39 0.64 0.61 0.54
## PstatusT -0.10 0.38 -0.27 0.79
## nurseryyes -0.31 0.29 -1.08 0.28
## failures 0.65 0.20 3.29 0.00
## famrel -0.27 0.12 -2.14 0.03
## ------------------------------------------------
jtools::summ(glm(high_use ~ failures + sex + address + famrel + freetime + goout + internet, family="binomial", data=pormath)) #freetime and internet access not significant; removing them.
## MODEL INFO:
## Observations: 370
## Dependent Variable: high_use
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## <U+03C7>²(7) = 83.08, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.29
## Pseudo-R² (McFadden) = 0.18
## AIC = 384.96, BIC = 416.27
##
## Standard errors: MLE
## ------------------------------------------------
## Est. S.E. z val. p
## ----------------- ------- ------ -------- ------
## (Intercept) -2.23 0.77 -2.89 0.00
## failures 0.45 0.23 1.99 0.05
## sexM 0.89 0.26 3.38 0.00
## addressU -0.70 0.32 -2.22 0.03
## famrel -0.43 0.14 -3.00 0.00
## freetime 0.11 0.14 0.75 0.46
## goout 0.76 0.13 5.90 0.00
## internetyes 0.26 0.38 0.69 0.49
## ------------------------------------------------
jtools::summ(glm(high_use ~ failures + sex + address + famrel + goout + absences, family="binomial", data=pormath))
## MODEL INFO:
## Observations: 370
## Dependent Variable: high_use
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## <U+03C7>²(6) = 94.64, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.32
## Pseudo-R² (McFadden) = 0.21
## AIC = 371.39, BIC = 398.79
##
## Standard errors: MLE
## ------------------------------------------------
## Est. S.E. z val. p
## ----------------- ------- ------ -------- ------
## (Intercept) -2.27 0.71 -3.22 0.00
## failures 0.40 0.23 1.73 0.08
## sexM 1.03 0.27 3.82 0.00
## addressU -0.65 0.31 -2.07 0.04
## famrel -0.40 0.15 -2.79 0.01
## goout 0.76 0.13 6.02 0.00
## absences 0.08 0.02 3.54 0.00
## ------------------------------------------------
m3 <- glm(high_use ~ failures + sex + address + famrel + goout + absences, family="binomial", data=pormath)
exp(cbind(OR = coef(m3), confint(m3)))
## OR 2.5 % 97.5 %
## (Intercept) 0.1030586 0.0248787 0.4001095
## failures 1.4913178 0.9538249 2.3728914
## sexM 2.7874285 1.6580283 4.7602275
## addressU 0.5238851 0.2836255 0.9691226
## famrel 0.6672113 0.5004071 0.8856192
## goout 2.1448371 1.6844140 2.7725681
## absences 1.0816926 1.0365062 1.1324925
# predict() the probability of high_use
probabilities3 <- predict(m3, type = "response")
# add the predicted probabilities
pormath <- mutate(pormath, probability = probabilities3)
# use the probabilities to make a prediction of high_use
pormath <- mutate(pormath, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = pormath$high_use, prediction = pormath_s$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 248 11
## TRUE 99 12
ggplot(pormath, aes(probability, high_use, col=prediction)) +
geom_point()
table(high_use = pormath$high_use, prediction = pormath$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64054054 0.05945946 0.70000000
## TRUE 0.15135135 0.14864865 0.30000000
## Sum 0.79189189 0.20810811 1.00000000
loss_func1 <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func1(class = pormath$high_use, prob = pormath$probability)
## [1] 0.2108108
#cross-validate
cv2 <- boot::cv.glm(data = pormath, cost = loss_func, glmfit = m3, K = 10)
#print error-rate of cross-validation
cv2$delta
## [1] 0.2108108 0.2175676
In this exercise the data used is the Boston dataset included in the MASS package. It comprises of 506 observations of 14 variables. The original rationale of the data was to predict housing prices by the NOx-levels by the area (see Harrison & Rubinfield, 1978). The 506 observations are of census tracts, and not districts as I supposed by the structure of the data. Details of the variables can be found here
Further, the rationale of this exercise is to find appropriate number of clusters.
Below you find the summary of the variables, overview of the data and density plots of the variables and finally the correlation plot of the data.
From these figures we can see, that accessibility to radial highways (rad) correlates strongly, (r = .91), with full-value property-tax rate (tax). The NOx levels seem to be negatively correlated to distance to Boston employment centers (r = -.75), in other word, the further from these centres, the lower the NOx-levels. Further, it seems like that the older the buildings the higher the NOx levels (r = .73), which I assume might correspond to city-center areas. With eyeballing these figures, it could be argued that the data might be clustered to two clusters which correspond to the shapes, peaks and polarisation seen in the density plots.
library(tidyverse)
library(corrplot)
library(MASS)
library(viridis)
library(GGally)
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
pairs(Boston)
Boston %>%
gather(key=var_name, value = value) %>%
ggplot(aes(x=value)) +
geom_density() +
facet_wrap(~var_name, scales="free") + theme_bw()
cor_matrix <- cor(Boston) %>% round(digits = 2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
corrplot(cor_matrix, method="color", type = "lower", cl.pos="b", tl.pos="d", tl.cex=0.7, cl.cex = 0.7, col = viridis(n=100), tl.col = "black" ,addCoef.col=T, number.cex=0.5)
First, the data will be scaled (x - mean(x)) / sd(x)). Then we create a categorical variable of the crime-variable with quantiles as break points. Then we remove the old crim variable, and add the categorised crime variable into the dataframe instead.
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks=bins, include.lowest=T, label=c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Next the data will be divided to train and test sets.
n <- 506
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test<-boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
Next I run a LDA discriminant analysis. The analysis identifies the high crime rate class rather well, however a few med_high classes would be wrongly classified. The predictions among low crime category is also good. However, in the mid categories, the classification could be better. Rad, ie., access to radial highways, is the strongest predictor in this model. Proportion of residential land and NOx being the second largest estimates.
# MASS and train are available
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2450495 0.2623762 0.2351485 0.2574257
##
## Group means:
## zn indus chas nox rm age
## low 0.90323503 -0.8715002 -0.113254311 -0.8607135 0.42128488 -0.8651032
## med_low -0.09851445 -0.3069298 -0.012331882 -0.5651516 -0.12170461 -0.3725274
## med_high -0.36808688 0.2030217 0.307875178 0.4492005 0.09210276 0.4648910
## high -0.48724019 1.0170690 -0.007331936 1.0304064 -0.40900621 0.8222461
## dis rad tax ptratio black lstat
## low 0.8754679 -0.6848939 -0.7359514 -0.3849110 0.38408246 -0.754440734
## med_low 0.3413700 -0.5528213 -0.4860997 -0.0448247 0.31807702 -0.161469445
## med_high -0.3942081 -0.4439051 -0.3292437 -0.4136518 0.08928928 0.004273273
## high -0.8622714 1.6386213 1.5144083 0.7813507 -0.77754282 0.941967561
## medv
## low 0.4744051
## med_low 0.0179738
## med_high 0.1707441
## high -0.6855334
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08214054 0.610895126 -1.05567509
## indus 0.10277939 -0.209296174 0.13144696
## chas -0.02559332 -0.045857753 0.07050103
## nox 0.36088956 -0.840424750 -1.15361152
## rm 0.04602237 -0.073221642 -0.19725765
## age 0.23790223 -0.500084560 -0.25841301
## dis -0.04831354 -0.338026418 0.05476344
## rad 3.66543656 0.864133508 -0.35070277
## tax 0.04481980 0.002393311 0.69122518
## ptratio 0.09641129 0.111291050 -0.17136656
## black -0.06841505 0.029905550 0.09576248
## lstat 0.16318592 -0.138210729 0.46673774
## medv 0.09463775 -0.425788059 -0.03668518
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9583 0.0317 0.0100
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "blue", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2.2)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 16 10 2 0
## med_low 2 12 6 0
## med_high 1 14 13 3
## high 0 0 0 23
Next I re-run and scale the Boston data set and calculate distances between the observations.
data("Boston")
boston_scaled_k <- scale(Boston)
boston_scaled_k <- as.data.frame(boston_scaled_k)
summary(dist(boston_scaled_k))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Next the data will be clustered with K-means. The largest drop in the plot appears at two (2) clusters. Therefore, a two-cluster solution test follows.
kmax <- 10
set.seed(123)
totws <- sapply(1:kmax, function(k){kmeans(boston_scaled_k, k)$tot.withinss})
qplot(x = 1:kmax, y=totws, geom='line') + theme_bw() +
ylab("Total within sum of squares")
A two cluster solution is plotted below. As anticipated from the density plots before, it looks like that the same variables correspond to the two cluster solution that were eminent from the density-plots. Non-retail business acres per town (indus) and property-tax rate (tax) can be seen very clearly. Also NOx levels seem to be nicely clustered to different clusters. The density plots are shown here aswell for a scroll-free comparison.
kme <- kmeans(boston_scaled_k, centers = 2)
pairs(boston_scaled_k, col=kme$cluster)
boston_scaled_k %>%
gather(key=var_name, value = value) %>%
ggplot(aes(x=value)) +
geom_density() +
facet_wrap(~var_name, scales="free") + theme_bw()
In this exercise, PCA and MCA analysis are practiced with two different data sets.
For the PCA, data used is the Human Development (HDI) index of UNDP. The data used in this exercise has observations from 155 countries of eight variables listed below. Further information of the data set and the variables can be found here.
As variables used in this exercise have been wrangled further, below follows an explanation of each variable:
h <- read.csv('./data/human.csv', header= T, row.names = 1)
library(GGally)
library(corrplot)
library(tidyverse)
library(dplyr)
library(viridis)
str(h)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
Following, you can find summaries and figures of the data at hand. We find, that there are large discrepancies in the data set, i.e., large inequality in between countries. From summaries, we see that more males are given the opportunity to finish secondary education compared to females and males are also more often in the labor force. The excepted years in education varies from 5 to 20 years, both mean and median being around 13 years. Life expectancy ranges from 49 to 83.5, maternal mortality rate from 0,001 % to 1,1 %. Adolescent birth-rate varies from 0,06 % to 20,5 %. Also, there are parliaments with no females, to parliaments having 57.5% female representatives.
From correlations, we see that maternal mortality rate is related to adolescent birth rate (r = .76). Expected years in education is positively correlated with life expectancy (r = .79), GNI (r = .62) and negatively with both maternal mortality (r = -.74) and adolescent birth rate (r = -.70), indicating that both expected years in education and life expectancy being properties of countries with higher GNI.
summary(h)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
my_fn <- function(data, mapping, method="p", use="pairwise", ...){
# grab data
x <- eval_data_col(data, mapping$x)
y <- eval_data_col(data, mapping$y)
# calculate correlation
corr <- cor(x, y, method=method, use=use)
# calculate colour based on correlation value
# Here I have set a correlation of minus one to blue,
# zero to white, and one to red
# Change this to suit: possibly extend to add as an argument of `my_fn`
colFn <- colorRampPalette(c("blue", "white", "red"), interpolate ='spline')
fill <- colFn(100)[findInterval(corr, seq(-1, 1, length=100))]
ggally_cor(data = data, mapping = mapping, ...) +
theme_void() +
theme(panel.background = element_rect(fill=fill))
} #this function was written by user20650 from stackoverflow, #https://stackoverflow.com/questions/45873483/ggpairs-plot-with-heatmap-of-correlation-values
ggpairs(h,
upper = list(continuous = my_fn), lower=list(combo=wrap("facethist", binwidth=20, size=1)))
h %>%
gather(key=var_name, value = value) %>%
ggplot(aes(x=value)) +
geom_histogram() +
facet_wrap(~var_name, scales = "free_x") +
theme_bw()
First we run PCA on non-scaled data, which naturally doesn’t make much sense since the scale of GNI is out of the roof compared to other scales (see graph below). It would imply that GNI alone explained 100% of the variance in the data. Therefore it is reasonable to scale the data before doing PCA, which follows.
pca_h <- prcomp(h)
s <- summary(pca_h)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
pca_pr1 <- round(100*s$importance[2,], digits = 1)
pca_pr1
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
pc_lab <- paste0(names(pca_pr1), " (", pca_pr1, "%)")
biplot(pca_h, choises=1:2, cex=c(0.8,1), col=c("slategray3","royalblue3"), xlab=pc_lab[1], ylab=pc_lab[2])
Therefore it is reasonable to scale the data before doing PCA.
On the bi-plot below, the arrows correspond to correlations. We see, that Edu.Exp, Life.Exp, Edu2.FM and GNI have a positive correlation, and these are negatively correlated to mat.mor and ado.birth. First component comprises of these aforementioned factors explaining 53.6 % of the variance, and second component of the male-female ratios in labor force and parliament explaining 16.2 % of the variance.
h_scaled <- scale(h)
pca_h_s <- prcomp(h_scaled)
s2 <- summary(pca_h_s)
s2
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
pca_pr2 <- round(100*s2$importance[2,], digits = 1)
pca_pr2
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")
biplot(pca_h_s, choises=1:2, cex=c(0.8, 1), col=c("slategray3","royalblue3" ), xlab=pc_lab2[1], ylab=pc_lab2[2])
The Tea data set from FactoMineR-library has 300 observations of 36 variables. The data set is about tea preferences. It is worth to note that there is not much documentation on this data set available, so some of my assumptions are pure guesses at best.
First we do a MCA on the whole data, excluding the variable of age as it was not a factor. As the plot doesn’t make much sense, I chose a few variables that I assumed to correspond to different dimensions.
I chose the following variables:
With these factors, the explained variance is enhanced on dimension 1. On dimension one, the factors correspond to being young or old, sportsman or not, the tea being diuretic or not. In a sense, correspondance to lifestyles can be seen here. On the second dimension shows a loading between those who prefer quality/speciality rather than big brands; the ones who prefer buying their teas from tea shops and unpacked versus those who buy teabags from chainstores.
Further, with plotellipses-function it is possible to draw confidence ellipses. There we can confirm that the type of the tea and where it is bought are most clearly separate groups from each other.
library(FactoMineR)
data("tea")
str(tea) #drop age, since it is not a factor
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
#ggpairs(tea)
tea_s <- subset(tea, select= -age) #drop age from df
mca <- MCA(tea_s, graph=F)
summary(mca)
##
## Call:
## MCA(X = tea_s, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.090 0.082 0.070 0.063 0.056 0.053 0.050
## % of var. 5.838 5.292 4.551 4.057 3.616 3.465 3.272
## Cumulative % of var. 5.838 11.130 15.681 19.738 23.354 26.819 30.091
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.048 0.047 0.044 0.041 0.040 0.039 0.037
## % of var. 3.090 3.053 2.834 2.643 2.623 2.531 2.388
## Cumulative % of var. 33.181 36.234 39.068 41.711 44.334 46.865 49.252
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.036 0.035 0.034 0.032 0.031 0.031 0.030
## % of var. 2.302 2.275 2.172 2.085 2.013 2.011 1.915
## Cumulative % of var. 51.554 53.829 56.000 58.086 60.099 62.110 64.025
## Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27 Dim.28
## Variance 0.028 0.027 0.026 0.025 0.025 0.024 0.024
## % of var. 1.847 1.740 1.686 1.638 1.609 1.571 1.524
## Cumulative % of var. 65.872 67.611 69.297 70.935 72.544 74.115 75.639
## Dim.29 Dim.30 Dim.31 Dim.32 Dim.33 Dim.34 Dim.35
## Variance 0.023 0.022 0.021 0.020 0.020 0.019 0.019
## % of var. 1.459 1.425 1.378 1.322 1.281 1.241 1.222
## Cumulative % of var. 77.099 78.523 79.901 81.223 82.504 83.745 84.967
## Dim.36 Dim.37 Dim.38 Dim.39 Dim.40 Dim.41 Dim.42
## Variance 0.018 0.017 0.017 0.016 0.015 0.015 0.014
## % of var. 1.152 1.092 1.072 1.019 0.993 0.950 0.924
## Cumulative % of var. 86.119 87.211 88.283 89.301 90.294 91.244 92.169
## Dim.43 Dim.44 Dim.45 Dim.46 Dim.47 Dim.48 Dim.49
## Variance 0.014 0.013 0.012 0.011 0.011 0.010 0.010
## % of var. 0.891 0.833 0.792 0.729 0.716 0.666 0.660
## Cumulative % of var. 93.060 93.893 94.684 95.414 96.130 96.796 97.456
## Dim.50 Dim.51 Dim.52 Dim.53 Dim.54
## Variance 0.009 0.009 0.008 0.007 0.006
## % of var. 0.605 0.584 0.519 0.447 0.390
## Cumulative % of var. 98.060 98.644 99.163 99.610 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.580 1.246 0.174 | 0.155 0.098 0.012 | 0.052 0.013
## 2 | -0.376 0.522 0.108 | 0.293 0.350 0.066 | -0.164 0.127
## 3 | 0.083 0.026 0.004 | -0.155 0.099 0.015 | 0.122 0.071
## 4 | -0.569 1.196 0.236 | -0.273 0.304 0.054 | -0.019 0.002
## 5 | -0.145 0.078 0.020 | -0.142 0.083 0.019 | 0.002 0.000
## 6 | -0.676 1.693 0.272 | -0.284 0.330 0.048 | -0.021 0.002
## 7 | -0.191 0.135 0.027 | 0.020 0.002 0.000 | 0.141 0.095
## 8 | -0.043 0.007 0.001 | 0.108 0.047 0.009 | -0.089 0.038
## 9 | -0.027 0.003 0.000 | 0.267 0.291 0.049 | 0.341 0.553
## 10 | 0.205 0.155 0.028 | 0.366 0.546 0.089 | 0.281 0.374
## cos2
## 1 0.001 |
## 2 0.021 |
## 3 0.009 |
## 4 0.000 |
## 5 0.000 |
## 6 0.000 |
## 7 0.015 |
## 8 0.006 |
## 9 0.080 |
## 10 0.052 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## breakfast | 0.182 0.504 0.031 3.022 | 0.020 0.007 0.000 0.330 |
## Not.breakfast | -0.168 0.465 0.031 -3.022 | -0.018 0.006 0.000 -0.330 |
## Not.tea time | -0.556 4.286 0.240 -8.468 | 0.004 0.000 0.000 0.065 |
## tea time | 0.431 3.322 0.240 8.468 | -0.003 0.000 0.000 -0.065 |
## evening | 0.276 0.830 0.040 3.452 | -0.409 2.006 0.087 -5.109 |
## Not.evening | -0.144 0.434 0.040 -3.452 | 0.214 1.049 0.087 5.109 |
## lunch | 0.601 1.678 0.062 4.306 | -0.408 0.854 0.029 -2.924 |
## Not.lunch | -0.103 0.288 0.062 -4.306 | 0.070 0.147 0.029 2.924 |
## dinner | -1.105 2.709 0.092 -5.240 | -0.081 0.016 0.000 -0.386 |
## Not.dinner | 0.083 0.204 0.092 5.240 | 0.006 0.001 0.000 0.386 |
## Dim.3 ctr cos2 v.test
## breakfast -0.107 0.225 0.011 -1.784 |
## Not.breakfast 0.099 0.208 0.011 1.784 |
## Not.tea time 0.062 0.069 0.003 0.950 |
## tea time -0.048 0.054 0.003 -0.950 |
## evening 0.344 1.653 0.062 4.301 |
## Not.evening -0.180 0.864 0.062 -4.301 |
## lunch 0.240 0.343 0.010 1.719 |
## Not.lunch -0.041 0.059 0.010 -1.719 |
## dinner 0.796 1.805 0.048 3.777 |
## Not.dinner -0.060 0.136 0.048 -3.777 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.031 0.000 0.011 |
## tea.time | 0.240 0.000 0.003 |
## evening | 0.040 0.087 0.062 |
## lunch | 0.062 0.029 0.010 |
## dinner | 0.092 0.000 0.048 |
## always | 0.056 0.035 0.007 |
## home | 0.016 0.002 0.030 |
## work | 0.075 0.020 0.022 |
## tearoom | 0.321 0.019 0.031 |
## friends | 0.186 0.061 0.030 |
plot(mca, invisible=c("ind"), habillage="quali", graph.type = "ggplot")
plotellipses(mca, graph.type = c("ggplot"))
keep_b <- c("Sport", "diuretic", "frequency", "where", "how", "sugar", "slimming", "relaxing", "effect.on.health", "feminine", "sex", "age_Q")
tea_b <- dplyr::select(tea, all_of(keep_b))
mca_b <- MCA(tea_b, graph=F)
summary(mca_b)
##
## Call:
## MCA(X = tea_b, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.166 0.154 0.138 0.116 0.106 0.098 0.092
## % of var. 10.473 9.741 8.710 7.331 6.689 6.188 5.788
## Cumulative % of var. 10.473 20.214 28.923 36.255 42.944 49.132 54.920
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.086 0.077 0.076 0.069 0.065 0.064 0.058
## % of var. 5.451 4.870 4.819 4.385 4.099 4.042 3.678
## Cumulative % of var. 60.372 65.242 70.061 74.445 78.545 82.587 86.265
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19
## Variance 0.055 0.052 0.042 0.038 0.031
## % of var. 3.491 3.267 2.645 2.385 1.946
## Cumulative % of var. 89.757 93.024 95.669 98.054 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.571 0.655 0.224 | 0.093 0.019 0.006 | 0.055
## 2 | 0.161 0.052 0.023 | 0.023 0.001 0.000 | 0.159
## 3 | 0.248 0.124 0.065 | -0.165 0.058 0.028 | 0.026
## 4 | -0.639 0.822 0.379 | -0.317 0.217 0.093 | 0.069
## 5 | 0.013 0.000 0.000 | -0.004 0.000 0.000 | 0.115
## 6 | -0.559 0.627 0.312 | -0.232 0.117 0.054 | -0.036
## 7 | -0.321 0.207 0.056 | 0.094 0.019 0.005 | -0.083
## 8 | -0.060 0.007 0.002 | -0.380 0.312 0.088 | 0.147
## 9 | -0.084 0.014 0.004 | 0.297 0.191 0.055 | -0.774
## 10 | 0.254 0.129 0.040 | 0.293 0.186 0.053 | -0.612
## ctr cos2
## 1 0.007 0.002 |
## 2 0.061 0.023 |
## 3 0.002 0.001 |
## 4 0.011 0.004 |
## 5 0.032 0.013 |
## 6 0.003 0.001 |
## 7 0.017 0.004 |
## 8 0.053 0.013 |
## 9 1.448 0.375 |
## 10 0.906 0.233 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## Not.sportsman | 0.406 3.338 0.111 5.769 | -0.066 0.095
## sportsman | -0.274 2.256 0.111 -5.769 | 0.045 0.064
## diuretic | 0.406 4.808 0.228 8.253 | 0.039 0.048
## Not.diuretic | -0.561 6.640 0.228 -8.253 | -0.054 0.066
## 1/day | -0.367 2.146 0.062 -4.323 | 0.086 0.126
## 1 to 2/week | -0.698 3.595 0.084 -5.007 | -0.317 0.795
## +2/day | 0.466 4.621 0.159 6.905 | -0.092 0.194
## 3 to 6/week | 0.189 0.203 0.005 1.168 | 0.514 1.618
## chain store | -0.221 1.573 0.087 -5.099 | -0.449 6.958
## chain store+tea shop | 0.497 3.233 0.087 5.098 | 0.367 1.893
## cos2 v.test Dim.3 ctr cos2 v.test
## Not.sportsman 0.003 -0.940 | 0.223 1.214 0.034 3.172 |
## sportsman 0.003 0.940 | -0.151 0.820 0.034 -3.172 |
## diuretic 0.002 0.794 | 0.145 0.741 0.029 2.955 |
## Not.diuretic 0.002 -0.794 | -0.201 1.024 0.029 -2.955 |
## 1/day 0.003 1.010 | 0.091 0.158 0.004 1.069 |
## 1 to 2/week 0.017 -2.270 | 0.808 5.793 0.112 5.796 |
## +2/day 0.006 -1.364 | -0.289 2.140 0.061 -4.285 |
## 3 to 6/week 0.034 3.177 | -0.220 0.331 0.006 -1.358 |
## chain store 0.358 -10.342 | 0.299 3.461 0.159 6.898 |
## chain store+tea shop 0.047 3.762 | -1.114 19.493 0.436 -11.417 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Sport | 0.111 0.003 0.034 |
## diuretic | 0.228 0.002 0.029 |
## frequency | 0.210 0.051 0.139 |
## where | 0.097 0.531 0.476 |
## how | 0.012 0.516 0.596 |
## sugar | 0.288 0.020 0.002 |
## slimming | 0.104 0.028 0.146 |
## relaxing | 0.039 0.117 0.011 |
## effect.on.health | 0.001 0.012 0.065 |
## feminine | 0.252 0.103 0.038 |
plot(mca_b, invisible=c("ind"), habillage="quali", graph.type="ggplot")
plotellipses(mca_b, graph.type = c("ggplot"))
In this exercise two longitudinal data sets are analysed. The structure of these data sets are seen below. Both data sets have been converted from wide to long data format in a separate R-script.
library(dplyr)
library(tidyverse)
library(cowplot)
library(sjPlot)
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt",
sep= "", header=T) #reading this for the pairs
BPRSL <- read.csv('./data/BPRSL.csv', header = T, sep=",")
RATSL <- read.csv('./data/RATSL.csv', header=T, sep=",")
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ time : int 1 1 1 1 1 1 1 1 1 1 ...
str(BPRSL)
## 'data.frame': 360 obs. of 6 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
RATSL$Group <- factor(RATSL$Group)
RATSL$ID <- factor(RATSL$ID)
First we begin with a data set of rats. The rats (N=16) were assigned to three different groups and each group had different diets. In this exercise it is assessed if the growth of the rats differed between groups, i.e., as a function of their diets.
First, we visualize the individual (rats as individuals) growth profiles in raw-scores throughout the weight-ins.
ggplot(RATSL, aes(time, Weight, color=ID, linetype=ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
scale_color_viridis_d(begin=.01, end=.8) +
facet_wrap(~Group, labeller=label_both) +
theme_bw() +
theme(legend.position ="none")
Next we scale (X - Mean(X) / SD(X)) the weight variable by group and time and plot it again. Here we can see relative growth in terms of means and standard deviations. After this I plotted the mean summaries by group.
From the scaled growth by time plot, it is quite clear that in each group there is possibly one outlier.
#RATSL.s <- RATSL %>%
# group_by(Group, time) %>%
# mutate_at(c(4), funs(c(scale(.)))) %>% ungroup()
#glimpse(RATSL.s)
#glimpse(RATSL)
RATSL.s <- RATSL %>%
group_by(Group, time) %>%
mutate(Weight.s = (Weight - mean(Weight))/sd(Weight) ) %>%
ungroup()
ggplot(RATSL.s, aes(time, Weight.s, linetype=ID, color=ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
scale_color_viridis_d(begin=.01, end=.8) +
facet_wrap(~Group, labeller=label_both) +
theme_bw() +
theme(legend.position ="none") +
ylab("Weight Scaled (x - mean(x) / sd (x)")
n <- RATSL$ID %>% unique() %>% length()
RATSL.sb <- RATSL %>%
group_by(Group, time) %>%
summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
ungroup()
ggplot(RATSL.sb, aes(time, mean, color=Group, linetype=Group, shape=Group)) +
geom_line() +
geom_point(size=2.5) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
scale_color_viridis_d(begin=.01, end=.8) +
theme_bw() +
theme(legend.position = c(0.9,0.4)) +
ylab("Scaled Means of Weight") +
labs(title="Scaled means and standard errors of weight among groups of rats")
Next we create a summary data by individual and group to identify the possible outliers. For each group we find that the boxplot function marks one outlier. To get the exact number of the outlier for each group, I simply checked the min and max by group from the tibble below. Outliers were 238.9 in group 1, 594.0 in group 2 and 495.2 in group 3. These are filtered out and boxplot without outliers plottet again. As the outliers were treated out of the groups, we see that the differences in group means are larger than they seemed (e.g. in between group 2 and 3) before removing the outliers.
RATSL8S <- RATSL %>%
filter(time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
glimpse(RATSL8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 4~
rm1 <- ggplot(RATSL8S, aes(x = Group, y = mean, fill=Group)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4) +
scale_fill_viridis_d(begin=.01, end=.8) +
scale_y_continuous(name = "mean(Weight) by group") +
theme_bw() +
labs(title="With outliers")
RATSL8S %>%
group_by(Group) %>%
summarize(min = min(mean),
max = max(mean)) %>%
ungroup()
## # A tibble: 3 x 3
## Group min max
## <fct> <dbl> <dbl>
## 1 1 239. 276.
## 2 2 444. 594
## 3 3 495. 542.
RATSL8S1 <- filter(RATSL8S, mean != 238.9, mean != 594.0, mean != 495.2)
RATSL8S1$mean
## [1] 263.2 261.7 267.2 270.9 276.2 274.6 267.5 443.9 457.5 455.8 536.4 542.2
## [13] 536.2
rm2 <- ggplot(RATSL8S1, aes(x = Group, y = mean, fill=Group)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4) +
scale_fill_viridis_d(begin=.01, end=.8) +
scale_y_continuous(name = "mean(Weight) by group") +
theme_bw() +
labs(title="Without outliers") +
theme(legend.position = c(0.9,0.4))
cowplot::plot_grid(rm1 +
theme(legend.position = "none"),
rm2, nrow=1)
As the outliers are treated, testing the mean difference is more meaningful. In this exercise we were to follow a certain study protocol, and at this point a t-test of group means would have been done. However, I ran an one-way ANOVA, as it offers the possibility to compare multiple independent groups instead of two.
From the results of one-way ANOVA We find the mean differences between group being significant F=2836, p < .001. From the TukeyHSD test below we find group comparisons as mean differences with confidence intervals.
Further, I fit a linear model with data to which we add the baseline (first weight-in) weight of the rats. From this model it is clear that the baseline measure is what predicts the mean weight and group differences are not significant, in other words, it seems that the diets were not what predicted the mean weight, rather the baseline (i.e., genetics) weight was what mattered. Alongside the summaries, the marginal effects (model basad predictions) are plotted below, where the association of baseline weight to mean weight can be visually examined.
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt",
sep="", header=T)
ratanova <-aov(mean ~ Group , data = RATSL8S1)
summary(ratanova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 176917 88458 2836 1.69e-14 ***
## Residuals 10 312 31
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(ratanova)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mean ~ Group, data = RATSL8S1)
##
## $Group
## diff lwr upr p adj
## 2-1 183.64286 173.07885 194.20686 0
## 3-1 269.50952 258.94552 280.07353 0
## 3-2 85.86667 73.36717 98.36617 0
RATSL8S2 <- RATSL8S %>%
mutate(baseline = RATS$WD1)
ratlm <- lm(mean ~ baseline + Group, data = RATSL8S2)
summary(ratlm)
##
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSL8S2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.905 -4.194 2.190 7.577 14.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.16375 21.87657 1.516 0.1554
## baseline 0.92513 0.08572 10.793 1.56e-07 ***
## Group2 34.85753 18.82308 1.852 0.0888 .
## Group3 23.67526 23.25324 1.018 0.3287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.68 on 12 degrees of freedom
## Multiple R-squared: 0.9936, Adjusted R-squared: 0.992
## F-statistic: 622.1 on 3 and 12 DF, p-value: 1.989e-13
anova(ratlm)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot_model(ratlm, type = "pred", terms = c("baseline", "Group")) +
scale_fill_viridis_d(option = "D", begin=.01, end=.8) +
scale_color_viridis_d(begin=.01, end=.8) +
labs(title = "") + theme_bw() +
labs(title="Predictions of mean weight by baseline weight")
Next we assess the data where 40 male subjects have been assessed the brief psychiatric measure scale (BPRS) on an eight-week-period in randomized to two treatment groups, 20 subjects in each. In the original data the subjects were numbered from one to 20 in each group, creating the problem of false doubles (i.e., subject 1 in treatment 1 and treatment 2). Therefore in the data wrangling part in a separate R-file, a column for identification (‘id’) was added in order to distinguish in between subjects in plots and analyses where they were not separated by treatment-group.
Unfortunately information about the treatment in question was not available for me as Davis (2002) from which the data is from, was not available. But as the BPRS is used for e.g. assessment of schizophrenia, and the analyses below show that these treatment groups do not differ very largely from each other in regard to their outcomes, I suppose both groups received quite similar treatment with some variation.
Below, the A-plot is an OLS-model of treatment groups, B is an lowess-model and C follow the example from the book (which I did not find very meaningful with this data, also scatter plots of the BPRS can be found below, open picture in a new window to enlarge). At this point I find it interesting to take a look at if OLS or lowess can show us an effect in between these treatment groups. After this this aforementioned plots, individual BPRS-scores are plotted by treatment group. At this level of analysis, it looks like the BPRS scores of treatment group 1 decreases more than in treatment group 2, however, the treatment groups do not differ significantly as the CI-bands for both model-plots overlap. Lets see what happens when we dig a bit deeper.
#Note for peer-reviewer; BPRS data was read already at the first chunk and factored
glimpse(BPRSL)
## Rows: 360
## Columns: 6
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0~
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, ~
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
b1 <- ggplot(BPRSL, aes(x = week, y = bprs, color=treatment)) +
geom_smooth(method="lm") +
scale_color_viridis_d(begin=.01, end=.8)
b2 <- ggplot(BPRSL, aes(x = week, y = bprs, color=treatment)) +
geom_smooth(method="loess") +
scale_color_viridis_d(begin=.01, end=.8)
b3 <- ggplot(BPRSL, aes(x=week, y=bprs, color=treatment)) +
geom_text(aes(label=treatment), show.legend=T) +
scale_color_viridis_d(begin=.01, end=.8)
legend2 <- get_legend(b1 + theme(legend.position = "left"))
cowplot::plot_grid(b1 + theme(legend.position = "none"),
b2 + theme(legend.position = "none"),
b3 + theme(legend.position = "none"),
legend2, nrow=1,
labels=LETTERS[1:3])
ggplot(BPRSL, aes(week, bprs, linetype=subject, color=subject)) +
geom_line(aes(group = subject)) +
geom_point(alpha=0.3) +
scale_linetype_manual(values=rep(1:10, times=4)) +
scale_color_viridis_d() +
facet_grid(~treatment, labeller=label_both) +
theme_bw() +
theme(legend.position ="none")
pairs(BPRS)
Creating a random intercept model (b_lmer) and random intercept and slope model (b_lmer2) to account for subject-level differences. It still looks like that treatment 1 is more effective compared to treatment 2 (Estimate = .57 with random intercept and 1.51 in random slope compared to treatment 1).
b_lmer <- lme4::lmer(bprs ~ week + treatment + (1 | id), data = BPRSL, REML = F)
summary(b_lmer)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | id)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2582.9 2602.3 -1286.5 2572.9 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27506 -0.59909 -0.06104 0.44226 3.15835
##
## Random effects:
## Groups Name Variance Std.Dev.
## id (Intercept) 97.39 9.869
## Residual 54.23 7.364
## Number of obs: 360, groups: id, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.3521 19.750
## week -2.2704 0.1503 -15.104
## treatment2 0.5722 3.2159 0.178
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.256
## treatment2 -0.684 0.000
b_lmer2 <- lme4::lmer(bprs ~ week + treatment + (week | id), data = BPRSL, REML = F)
summary(b_lmer2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | id)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.2 2550.4 -1254.6 2509.2 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4655 -0.5150 -0.0920 0.4347 3.7353
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## id (Intercept) 167.827 12.955
## week 2.331 1.527 -0.67
## Residual 36.747 6.062
## Number of obs: 360, groups: id, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.9830 2.6470 17.372
## week -2.2704 0.2713 -8.370
## treatment2 1.5139 3.1392 0.482
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.545
## treatment2 -0.593 0.000
anova(b_lmer, b_lmer2)
## Data: BPRSL
## Models:
## b_lmer: bprs ~ week + treatment + (1 | id)
## b_lmer2: bprs ~ week + treatment + (week | id)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## b_lmer 5 2582.9 2602.3 -1286.5 2572.9
## b_lmer2 7 2523.2 2550.4 -1254.6 2509.2 63.663 2 1.499e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Below we compare the random intercept slope model to the same model with interaction of week * treatment. However, from the anova we see that this interaction model isn’t a better fit to the data. I will plot both the b_lmer2 (main effects with random slope) and b_lmer3 (interaction with random slope) below so you can inspect the difference.
b_lmer3 <- lme4::lmer(bprs ~ week * treatment + (week | id), data = BPRSL, REML = F)
summary(b_lmer3)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | id)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.5 2554.5 -1253.7 2507.5 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4747 -0.5256 -0.0866 0.4435 3.7884
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## id (Intercept) 164.204 12.814
## week 2.203 1.484 -0.66
## Residual 36.748 6.062
## Number of obs: 360, groups: id, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.9840 16.047
## week -2.6283 0.3752 -7.006
## treatment2 -2.2911 4.2200 -0.543
## week:treatment2 0.7158 0.5306 1.349
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.668
## treatment2 -0.707 0.473
## wek:trtmnt2 0.473 -0.707 -0.668
anova(b_lmer3, b_lmer2)
## Data: BPRSL
## Models:
## b_lmer2: bprs ~ week + treatment + (week | id)
## b_lmer3: bprs ~ week * treatment + (week | id)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## b_lmer2 7 2523.2 2550.4 -1254.6 2509.2
## b_lmer3 8 2523.5 2554.6 -1253.7 2507.5 1.78 1 0.1821
Next I plotted the fitted model and observed model by treatment groups. I also wanted to visualise the differences in between main effects + random slope and interaction effect + random slope. From the line-graph the difference is almost unnoticeable but in the forest plots (labeled A and B) the differences are more apparent. However, I guess it is safe to say that the take home message from these analyses and visualizations is that patients in treatment group 1 were better off – unless we agree that there was an outlier in treatment group 2. This should be taken into account here, which follows next.
BPRSL2 <- BPRSL
BPRSL2$id <- as.factor(BPRSL2$id)
BPRSL2$Fitted <- fitted(b_lmer2, bprs)
colnames(BPRSL2)[5] <- "Observed"
reshaped <- reshape::melt(BPRSL2,
id.vars=c("treatment","subject", "id", "week"),
measure.vars=c("Observed", "Fitted"),
variable.name = "key",
value.name = "bprs")
ggplot(reshaped, aes(x=week, y=value, group=id, color=id)) +
geom_line(aes(linetype=treatment)) +
scale_color_viridis_d(guide="colourbar", option="D", begin="0.01", end="0.8") +
facet_wrap(~ variable) +
theme(legend.position="right") +
xlab("Week") +
ylab("BPRS Score") +
labs(title="Observed and Fitted BPRS-score with main effects and random slope")
##
BPRSL3 <- BPRSL2
BPRSL3$Fitted <- fitted(b_lmer3, bprs)
colnames(BPRSL3)[5] <- "Observed"
reshaped2 <- reshape::melt(BPRSL3,
id.vars=c("treatment","subject", "id", "week"),
measure.vars=c("Observed", "Fitted"),
variable.name = "key",
value.name = "bprs")
ggplot(reshaped2, aes(x=week, y=value, group=id, color=id)) +
geom_line(aes(linetype=treatment)) +
scale_color_viridis_d(guide="colourbar", option="D", begin="0.01", end="0.8") +
facet_wrap(~ variable) +
theme(legend.position="right") +
xlab("Week") +
ylab("BPRS Score") +
labs(title="Observed and Fitted BPRS-score with interaction and random slope")
f1 <-plot_model(b_lmer2, type = "pred", terms = c("treatment", "week")) +
scale_fill_viridis_d(option = "D", begin=.01, end=.8) +
scale_color_viridis_d(begin=.01, end=.8) +
labs(title = "") + theme_bw() +
labs(title="Fitted BPRS-score \n with main effects \n and random slope")
f2 <- plot_model(b_lmer3, type = "pred", terms = c("treatment", "week")) +
scale_fill_viridis_d(option = "D", begin=.01, end=.8) +
scale_color_viridis_d(begin=.01, end=.8) +
labs(title = "") + theme_bw() +
labs(title="Fitted BPRS-score \n with interaction \n and random slope")
f2_l <- get_legend(f2 + theme(legend.position = "right"))
cowplot::plot_grid(f1 + theme(legend.position = "none"),
f2 + theme(legend.position = "none"),
align = 'vh',
f2_l,
nrow=1,ncol=3,
labels=LETTERS[1:2])
As from the first plot (observed BPRS-scores), it was clear that in treatment group 2 there was one outlier, which was clearly deviant from other subjects. It may therefore be wise to redo analyses with this subject removed from the data set. Here follows the same analyses as above, but id 31 (the identified outlier) is removed from the data. Looking at the results now, we see that in all models the T-values are so low, that the estimates are due to pure chance, i.e., the estimate difference between treatment groups is not statistically significant.
BPRSL_o <- subset(BPRSL, id!="31")
###Observed plot ####
ggplot(BPRSL_o, aes(week, bprs, linetype=subject, color=subject)) +
geom_line(aes(group = subject)) +
geom_point(alpha=0.3) +
scale_linetype_manual(values=rep(1:10, times=4)) +
scale_color_viridis_d() +
facet_grid(~treatment, labeller=label_both) +
theme_bw() +
theme(legend.position ="none") +
labs(title="Observations without outlier in treatment group 2")
### Lmers ####
b_lmer_o <- lme4::lmer(bprs ~ week + treatment + (1 | id), data = BPRSL_o, REML = F)
summary(b_lmer)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | id)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2582.9 2602.3 -1286.5 2572.9 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27506 -0.59909 -0.06104 0.44226 3.15835
##
## Random effects:
## Groups Name Variance Std.Dev.
## id (Intercept) 97.39 9.869
## Residual 54.23 7.364
## Number of obs: 360, groups: id, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.3521 19.750
## week -2.2704 0.1503 -15.104
## treatment2 0.5722 3.2159 0.178
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.256
## treatment2 -0.684 0.000
b_lmer2_o <- lme4::lmer(bprs ~ week + treatment + (week | id), data = BPRSL_o, REML = F)
summary(b_lmer2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | id)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.2 2550.4 -1254.6 2509.2 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4655 -0.5150 -0.0920 0.4347 3.7353
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## id (Intercept) 167.827 12.955
## week 2.331 1.527 -0.67
## Residual 36.747 6.062
## Number of obs: 360, groups: id, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.9830 2.6470 17.372
## week -2.2704 0.2713 -8.370
## treatment2 1.5139 3.1392 0.482
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.545
## treatment2 -0.593 0.000
anova(b_lmer_o, b_lmer2_o)
## Data: BPRSL_o
## Models:
## b_lmer_o: bprs ~ week + treatment + (1 | id)
## b_lmer2_o: bprs ~ week + treatment + (week | id)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## b_lmer_o 5 2503.7 2522.9 -1246.8 2493.7
## b_lmer2_o 7 2440.8 2467.9 -1213.4 2426.8 66.817 2 3.096e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
b_lmer3_o <- lme4::lmer(bprs ~ week * treatment + (week | id), data = BPRSL_o, REML = F)
summary(b_lmer3)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | id)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.5 2554.5 -1253.7 2507.5 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4747 -0.5256 -0.0866 0.4435 3.7884
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## id (Intercept) 164.204 12.814
## week 2.203 1.484 -0.66
## Residual 36.748 6.062
## Number of obs: 360, groups: id, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.9840 16.047
## week -2.6283 0.3752 -7.006
## treatment2 -2.2911 4.2200 -0.543
## week:treatment2 0.7158 0.5306 1.349
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.668
## treatment2 -0.707 0.473
## wek:trtmnt2 0.473 -0.707 -0.668
anova(b_lmer3_o, b_lmer2_o)
## Data: BPRSL_o
## Models:
## b_lmer2_o: bprs ~ week + treatment + (week | id)
## b_lmer3_o: bprs ~ week * treatment + (week | id)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## b_lmer2_o 7 2440.8 2467.9 -1213.4 2426.8
## b_lmer3_o 8 2441.0 2471.9 -1212.5 2425.0 1.8482 1 0.174
### plots of fits ####
BPRSL2_o <- BPRSL_o
BPRSL2_o$id <- as.factor(BPRSL2_o$id)
BPRSL2_o$Fitted <- fitted(b_lmer2_o, bprs)
colnames(BPRSL2_o)[5] <- "Observed"
reshaped_o <- reshape::melt(BPRSL2_o,
id.vars=c("treatment","subject", "id", "week"),
measure.vars=c("Observed", "Fitted"),
variable.name = "key",
value.name = "bprs")
z1 <- ggplot(reshaped_o, aes(x=week, y=value, group=id, color=id)) +
geom_line(aes(linetype=treatment)) +
scale_color_viridis_d(guide="colourbar", option="D", begin="0.01", end="0.8") +
facet_wrap(~ variable) +
theme(legend.position="right") +
xlab("Week") +
ylab("BPRS Score") +
labs(title="Observed and Fitted \n BPRS-score with main effects \n and random slope",
subtitle="Outlier removed from \n treatment group 2")
##
BPRSL3_o <- BPRSL2_o
BPRSL3_o$Fitted <- fitted(b_lmer3_o, bprs)
colnames(BPRSL3_o)[5] <- "Observed"
reshaped_o <- reshape::melt(BPRSL3_o,
id.vars=c("treatment","subject", "id", "week"),
measure.vars=c("Observed", "Fitted"),
variable.name = "key",
value.name = "bprs")
z2 <- ggplot(reshaped_o, aes(x=week, y=value, group=id, color=id)) +
geom_line(aes(linetype=treatment)) +
scale_color_viridis_d(guide="colourbar", option="D", begin="0.01", end="0.8") +
facet_wrap(~ variable) +
theme(legend.position="right") +
xlab("Week") +
ylab("BPRS Score") +
labs(title="Observed and Fitted \n BPRS-score with interaction \n and random slope",
subtitle="Outlier removed from \n treatment group 2")
z2_l <- get_legend(z2 + theme(legend.position = "right"))
zgrid <- cowplot::plot_grid(z1 + theme(legend.position = "none"),
z2 + theme(legend.position = "none"),
align="hv",
ncol=2,
labels=LETTERS[1:2])
cowplot::plot_grid(zgrid, z2_l, ncol = 2, rel_widths = c(1, .2))
f1_o <-plot_model(b_lmer2_o, type = "pred", terms = c("treatment", "week")) +
scale_fill_viridis_d(option = "D", begin=.01, end=.8) +
scale_color_viridis_d(begin=.01, end=.8) +
labs(title = "") + theme_bw() +
labs(title="Fitted BPRS-score \n with main effects \n and random slope",
subtitle="Outlier removed from \n treatment group 2")
f2_o <- plot_model(b_lmer3_o, type = "pred", terms = c("treatment", "week")) +
scale_fill_viridis_d(option = "D", begin=.01, end=.8) +
scale_color_viridis_d(begin=.01, end=.8) +
labs(title = "") + theme_bw() +
labs(title="Fitted BPRS-score \n with interaction \n and random slope",
subtitle="Outlier removed from \n treatment group 2")
cowplot::plot_grid(f1_o + theme(legend.position = "none"),
f2_o + theme(legend.position = "none"),
align = "hv",
f2_l,
nrow=1,ncol=3,
labels=LETTERS[1:2])